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Abstract — Approximate dynamic programming has been in- 
vestigated and used as a tool to approximately solve optimal 
regulation problems. However, the extension of this technique to 
optimal tracking problems for continuous time nonlinear systems 
has remained a non-trivial open problem. The control devel- 
opment in this paper guarantees uniformly ultimately bounded 
(UUB) tracking of a desired trajectory, while also asymptotically 
converging to an approximate optimal policy. 

I. Introduction 

Reinforcement learning (RL) is a concept that can be used 
to enable an agent to learn optimal policies from its interaction 
with the environment. The objective of the agent is to learn 
the policy that maximizes or minimizes a cumulative long term 
reward. Almost all RL algorithms use some form of general- 
ized policy iteration (GPI). GPI is a set of two simultaneous 
interacting processes, policy evaluation (also referred to as 
the critic) and policy improvement (also referred to as the 
actor). Starting with an estimate of the state cost function and 
an admissible policy, the critic makes the estimate consistent 
with the policy and the actor makes the policy greedy with 
respect to the cost function. These algorithms exploit the fact 
that the optimal cost function satisfies Bellman's principle of 
optimality fl], El- 

The principle of optimality leads to a wide range of 
algorithms that focus on finding solutions to the Bellman 
equation (BE) or approximations of the BE. For discrete time 
systems, BE-based policy evaluation methods do not require a 
model of the environment, and hence, have been central to the 
development of RL |2|. Approximate dynamic programming 
(ADP) consists of algorithms that facilitate the solution of 
the approximate BE for problems with a continuous state 
space or an infinite discrete state space by utilizing a function 
approximation structure to approximate the state cost function 
0. 

The iterative nature of ADP has has made it an attractive 
method for solving discrete time optimal regulation problems. 
Within ADP literature, neural networks (NNs) are the most 
popular tool for cost function approximation |0]-|0. For 
continuous time systems, solutions to the optimal regulation 

Rushikesh Kamalapurkar, Huyen Dinh, and Warren Dixon are with the 
Department of Mechanical and Aerospace Engineering, University of Florida, 
Gainesville, FL, USA. Email: {rkamalapurkar, huyentdinh, wdixon]@ufl.edu. 

Shubhendu Bhasin is with the Department of Electrical Engineering, Indian 
Institute of Technology, Delhi, India, email: sbhasin@ee.iitd.ac.in. 

This research is supported in part by NSF award numbers 0547448, 
0901491, 1161260, and 1217908. Any opinions, findings and conclusions or 
recommendations expressed in this material are those of the author(s) and do 
not necessarily reflect the views of the sponsoring agency. 



problem using ADP have been proposed using discretization 
of the dynamical system, but these solutions become com- 
putationally prohibitive as the dimensionality of the problem 
increases. Advantage updating was proposed in ifTOl as a Q- 
learning method for continuous time systems, but it lacks an 
accompanying stability analysis. 

When applied to continuous time systems the principle 
of optimality leads to the Hamilton-Jacobi-Bellman (HJB) 
equation which is the continuous time counterpart of the BE 
ifTTl . Similar to discrete time ADP, continuous time ADP 
approaches aim at finding approximate solutions to the HJB 
equation. Various methods to solve this problem are proposed 
in lfl2ll - lfT8l and the references therein. 

An infinite horizon regulation problem with a quadratic cost 
function is the most common problem considered in ADP liter- 
ature. For these problems, function approximation techniques 
can be used to approximate the cost function because it is 
a time invariant function. In tracking problems, the tracking 
error, and hence the cost function, is a function of the state and 
an explicit function of time. Approximation techniques like 
NNs are commonly used in ADP literature for cost function 
approximation. However, NNs can only approximate functions 
on compact domains, thus leading to a technical challenge to 
approximate the cost function for a tracking problem because 
the infinite horizon nature of the problem implies that time 
does not lie on a compact set. 

For discrete time systems, several approaches have been 
developed to address the tracking problem. Park et.al. [19| 
use generalized backpropagation through time to solve a finite 
horizon tracking problem that involves offline training of NNs. 
An ADP-based approach is presented in lEOl to solve an 
infinite horizon optimal tracking problem where the desired 
trajectory is assumed to depend on the system states. A greedy 
heuristic dynamic programming based algorithm is presented 
in ETll which uses a system transformation to transform the 
nonautonomous system into an autonomous system. However, 
this result lacks an accompanying stability analysis. 

ADP-based approaches are presented in 11221 . Il23l for con- 
tinuous time systems. In both the results, the cost function 
(i.e. the critic), and the controller (i.e. the actor) presented 
are time-varying functions of the tracking error. However, 
as the problem being solved is an infinite horizon optimal 
control problem, time does not lie on a compact set. NNs can 
only approximate functions on a compact domain. Thus, it is 
unclear how a NN that only takes the tracking error as an input 
can approximate the time varying cost function and controller. 

This paper presents an approach to solve the continuous- 



time optimal tracking problem online using a system trans- 
formation to convert the problem into an optimal regulation 
problem in such a way that the resulting cost function is a 
stationary function of the transformed states, and hence, lends 
itself to approximation using a NN. The desired trajectory is 
assumed to be the output of a nonlinear dynamical system. A 
Lyapunov-based analysis is used to prove uniformly ultimately 
bounded (UUB) tracking and convergence to the approximate 
optimal control. 

II. Formulation of stationary optimal control 

PROBLEM 

Consider a class of nonlinear control affine systems 

i = f + gu, 

where i(()gxC R" is the state, and u (x (t) , t) G U C R m 
is the control input. The functions / (x (t)) : \ ~> R™ an d 
g(x(i)) : x ~> R nxm are Lipschitz continuous functions 
on X' where / (0) = 0, and the solution of the system 
is unique for any finite initial condition xq € x an d con ~ 
trol u (x (t) ,t) 6 U. The control objective is to track a 
bounded continuously differentiable signal Xd (t) : R + — > x- 
To quantify this objective, a tracking error is defined as 
e (x (t) ,Xd(t)) = x (t) — Xd (t). The open-loop tracking error 
dynamics can then be written as 



e = f + gu - x d - 



(1) 



The following assumptions are made to facilitate the for- 
mulation of an approximate optimal tracking controller. 

Assumption 1. The function g (x (t)) is bounded and has full 
column rank, and the function g + (x (t)) : x R mxn defined 

as g + (x (t)) = I g {x (t)) g(x(t))\ g (x (t)) is bounded 
and Lipschitz continuous. 

Note that g+ (x (t)) g (x (tj) = I mxm , where I mxm G 
M mxm denotes the identity matrix. 

Assumption 2. There exists a Lipschitz continuous function 

h d (x d (*)) : X -> R n such that x d = h d (x d (t)) and h d (0) = 
0. 

The steady state control input Ud (xd (t)) corresponding to 
the desired trajectory Xd (t) is 



u d = 9d ( h d - fd) , 



(2) 



where the signals f d (t) G R" and gf (t) el" xm are defined 
as gj (t) = g+ (x d (t)) and f d (t) = / (x d (t)), respectively. 
To transform the non-stationary optimal control problem into 
a stationary optimal control problem, a new concatenated state 
C (t) S X x X C R 2 " is defined as ED 



C 



(3) 



Based on (p} and Assumption^ the time derivative of <j3j) can 
be expressed as 

C = F + Gfi, (4) 



where the functions F (( (t)) : X X X -> K 2 ", G(C(t)) : 
X x x -> R 2nxm , and the policy fj, (( (t)) : x x x -> R m are 
defined as 



F(0 = 
G(0 = 



/ (e + x d ) 



• (e + x d ) 




/id (iCd) + g (e + x d ) w d 
(id) 



Md- 



(5) 



Lipschitz continuity of f (x (t)) and g(x(t)), the fact that 
/ (0) = and Assumption |2 imply that F (0) = and 
F (t)) is Lipschitz continuous in the sense that ||.F|| < 
|| C| | , where Lp £ R is a positive constant. The optimal 
control problem can now be formulated as the need to de- 
sign a policy /i (( (t)) € $(^x x) that minimizes the cost 
functional V (C (t) , M (C (*))) : X x x x * -> M+ defined as 



r 



^ (C 0) i M 3 )) d P> 



(6) 



subject to the dynamic constraints in (|4]) where ^ (x x x) is 
the set of admissible policies lfl2l . and r(C, A 4 ) € R is the 
local cost defined as 



C QC + M ^M- 



(7) 



In (O, i? € ]R mx « l j s a positive definite symmetric matrix of 
constants, and Q e R 2nx2n is defined as 



Q Omxn 
On x n On X n 



(8) 



where Q € R nXTl is a positive definite symmetric matrix 
of constants with the minimum eigenvalue \ m i n {Q}, and 



0„ 



is a matrix of zeros. 



III. Approximate solution 



Assuming that the minimizing policy exists, the HJB equa- 
tion for the optimal control problem can be written as 



H* 



V c *(F + Gv*)+r((,n*) =0, 



(9) 



where H* [V ( * (( (t)) , C (t) , fi* (C (*))J is the Hamiltonian, 

V c *(((t)) = ^smp- > and M*(C0)) denotes the optimal 
policy. For the local cost in Q and the dynamics in the 
optimal policy can be obtained in closed-form as JT] 

1 

T 2 



H* = --R- 1 G T V C * T , 



(10) 



assuming that the optimal cost function V* (£ (i)) satisfies 
V* (C(t)) e C 1 and V* (0) = 0. The following assumptions 
are made to facilitate the use of NNs to approximate the 
optimal policy and the optimal cost function. 

Assumption 3. The set x is compact. Based on the subsequent 
stability analysis, this assumption holds as long as the initial 
condition x (0) is bounded. See Remark [2] in the subsequent 
stability analysis. 



Assumption 4. The cost function V* (£(*)) can be repre- 
sented using a NN with N neurons as 



V* = W 1 



(11) 



where W G R is the ideal weight matrix bounded above by a 
known positive constant W G R in the sense that ||W|| 2 < W, 

^C(t)):xxx^l^ [*i(C(*)) ••• a w (C(t))] T isa 
bounded continuously differentiable nonlinear activation func- 
tion, and e (( (i)) : x x x R is the function reconstruction 
error_such that \e (C (t))\ < e and |e' (C {t))\ < e', where e G R 
and f'£l are positive constants [24|, [25|. 

From (|TU| and (jTTJ) the optimal policy can be represented 



as 



1 



R^G T (a'^ + e' 7 ) 



(12) 



Based on (jTTJ) and (fT2")l . the NN approximations to the optimal 
cost function and the optimal policy are given by 



V = a, 

H = -\R- x G T a' T W a , 



(13) 



where W c (t) € and W a (t) G R^ are estimates of the 
ideal neural network weights W. The use of two separate 
sets of weight estimates W a (t) and W c (t) to approximate the 
same ideal weights W is motivated by the fact that the Bellman 
error is linear with respect to the cost function weight estimates 
and nonlinear with respect to the policy weight estimates. Use 
of a separate set of weight estimates for the cost function 
facilitates the application of recursive least squares technique 
for adaptive updates. The controller u (x (t) , t) is obtained 
from ©, ©, and as 

u = -^R- 1 G T a' T W a +g+(h d -f d ). (14) 

Remark 1. Similar NNs have been previously developed to 
approximate the cost function and the policy (e.g. If22l . Il23l ). 
However, the tracking error is considered as the only input to 
the NNs, which implies that the cost function is considered to 
be a function of the tracking error alone. The presence of a 
term similar to Ud (xd) in the cost function definition in the 
aforementioned results makes the cost function a time-varying 
function of the tracking error. For an infinite horizon optimal 
control problem, time does not lie on a compact set, whereas 
NNs can only approximate functions on a compact domain. 
Thus, it is unclear how a NN with only the tracking error 
as the input can approximate the time-varying cost function. 
In this result, the tracking error and the desired trajectory 
both serve as inputs to the NN. This makes the controller 
in ( fT4b fundamentally different, in the sense that a different 
HJB equation must be solved and its solution, the feedback 
component p (£), is a time-varying function of the tracking 
error as opposed to a time-invariant function of the tracking 
error in results such as 11221 . Il23l . In particular, this paper 
addresses the technical obstacles that result from the non- 
stationary nature of the optimal control problem by including 
the term in the HJB equation. 



Using the approximations p (£ (t)) and V (( (t)) for 
p* (C (t)) and V* (C (t)) in ©, respectively, the approximate 

Hamiltonian H (v ( (C (f)) , C (*) , A« (C (*))) can be obtained 
as 

H = y c (F + G/i) + r(C,/i), 

where V z (((t)) = 9 . Using ©, the error between 
the approximate and the optimal Hamiltonian, called the 
Bellman Error S (v c (C (t)) ,£(*), M (C (f))) G R, is given in 
a measurable form by 

5^H-H* = V c {F + Gfi)+r((,n). (15) 

The cost function weights are updated to minimize J * S 2 (p) dp 
using a normalized least squares update law with an exponen- 
tial forgetting factor as [26], 11271 



W c 
f 



-Vc 



i 



j T Tu 



-Ar + r- 



UJOJ 



-r 



1 + vu) r Yu) 



(16) 
(17) 



where v, r\ c G R are positive adaptation gains, 
lo (C (t) , fx (( (t))) G R^ is defined as u = a' (F + Gp), 
and A G (0, 1) is the forgetting factor for the estimation gain 
matrix T (t) G R NxN . The policy weights are updated to 
minimize 5 2 using a gradient descent update law as 



W a =proj< - 



Val 



VI +W T UJ 
Va2 [W a -W, 



-.cr'GR- 1 G T 



Wr)6 



(18) 



where r) a i, r\ a i G R are positive adaptation gains, and proj {■} 
is a smooth projection operator [28|. The use of a forgetting 
factor ensures that [26 1, ||27l 



(flNxN < T (t) < IfilNxN, 



(19) 



where Tp, ip G R are constants such that < <p < Tp. Using 
©, (fT5|) . and (fT6]l . an unmeasurable form of the BE can be 
written as 



6 = -Wjto + - A WTg a W a + ^e'ge' T 
- e'F, 



\w T a'Ge' T 



(20) 



where Q = GR~ 1 G T and Q a = a'GR~ 1 G T a' T . The weight 
estimation errors for the cost function and the policy are 
defined as W c (t) = W - W c (i) and W a (t) = W - W a (t), 
respectively. Using ([20)) . the weight estimation error dynamics 
for the cost function are 



1 

+ ]e'Ge' T + \w T o'Qe 
4 2 

where i\> (( (t) , p (t)) , T (t)) 4 



- uu} t Tlu 



G 



(21) 



is the 



regressor vector. Based on (jT9|) , the regressor vector can be 
bounded as 

U\\ < —■ (22) 



finp 



The dynamics in (f2~lj) can be regarded as a perturbed form of 
the nominal system 



W c = -rydW W c 



(23) 



It is shown in l26l . Il27l that (|2"3"j) is globally exponentially 
stable if the regressor vector -0 (C (*) > A* (C (*)) ; T (£)) is per- 
sistently exciting. Given (fT5|). and (23J), Theorem 4.14 
in ||29 1 can be used to show that there exists a nonautonomous 
function V c \ W C (t) ,tj : R N x [0, oo) -> R and positive 
constants v c , u c i and v C 2 such that 



Proof: Uniform boundedness of 2 (x, t) implies that 
V(i,i) 6 D x R+, sup tgR + {S(x,i)} exists and is unique. 
Let the function a (x) : D — > R + be defined as 



a (x) — sup {S (x, t)} 

tGR+ 



with the property 



E{x,t) < a(x),Vt G 



(28) 



(29) 



Uniform continuity of E (x, t) implies that Ve > 0, 35 > 
such that V {x, t) ,(y,t) G D x R+, 



II 



«9U C 



II 



(-^W^) + < -v cl 

dVc. 



< v c2 



(24) 
(25) 

(26) 



((x,t),{y,t)) <6 



d R (S (a;, t) , E (y, t)) < e, 
(30) 



Using Assumptions [T] |2] and [4] and the fact the W a (t) is 
bounded by projection, the following bounds are developed to 
aid the subsequent stability analysis: 



where dju (•, •) denotes the standard Euclidean metric on the 
metric space M. By the definition of du (•)•)> 

d D xR+ ({x,t) , (y,t)) = d D (x,y) . (31) 

From (30]) and (31]) 

d D (x, y)<5 \S (x, t)-3 (y, t)\ < e. (32) 



1 



1. 



\QA<ia, 



IT 



Given the fact that H (x, t) is a positive function, ([3"2"]) implies 

e'L F \\xd\\ < ii, S (x, t) < E (y, t) + e and E (y,t) < E (x,t) + e which from 
([28]) implies a (x) < a (y) + e and a (y) < a (x) + e, and 
hence, from (l32l). 



< ^3, 



^W T Q a W a + \e'Qv' T W a 



< ^4, 



(27) 



where ti, t2, t-3, 14 € 1 are positive constants. 

IV. Stability Analysis 

The contribution in the previous section was the devel- 
opment of a transformation that enables the optimal policy 
and the optimal cost function to be expressed as a stationary 
function of The use of this transformation presents a 

challenge in the sense that the optimal cost function, which 
is used as the Lyapunov function for the stability analysis, is 
not a positive definite function of £ (t) because the matrix 
Q is positive semidefinite. In this section, we address this 
technical obstacle by exploiting the fact that the stationary 
Lyapunov function V* (C (*)) : X x X -> K= V* (e (t) , t) : 
X x [0, 00) -> E. Specifically, the use of V* (C (*)) facilitates 
the development of the approximate optimal policy, whereas 
the equivalent non-stationary form V* (e (t) , t) can be shown 
to be a positive definite and decresent function of the tracking 
error e (t). In the following, Lemma Q] and Lemma |2] are 
used to prove that, written as a nonautonomous function, 
V* (e (t) , t) is positive definite, and hence, a Lyapunov can- 
didate. Theorem Q] then states the main result of the paper. 

Lemma 1. Let D C l n contain the origin, and let (x, t) G 
D x R + . Any uniformly bounded, uniformly continuous, posi- 
tive definite nonautonomous function E (x, t) : D X ]R + — > R + 
is decresent in D. 



d D (x,y)<5 => \a (x) - a (y)\ < e. (33) 

Since 5 (x, i) is positive definite, ([28]) can be used to conclude 

a (0) = 0. (34) 

From ([29]) . (32 ]) -([34"| ) . the function S (x, t) is bounded above 
by a uniformly continuous positive definite function, and 
hence, is decresent in D. ■ 

Lemma 2. Let B a denote a closed ball around the origin with 
the radius a G K + . The optimal cost function V* (e (£) , t) : 
X x R + — > R satisfies the following properties 

V*(e(t),t) >«(||e||),V* GR+, (35a) 
V* (0,t) = Oyt G R+, (35b) 
V*(e(t),t) <TJ(||e||),VtGR + , (35c) 

for all e (t) G B a where v (||e||) : [0, a] — >• [0, 00) andv (||e|| ) : 
[0, a] — > [0,oo) are c/ass IC functions, and B a C X- 

Proq/:- a) By the definition of V (e (i) , i) in ©, and Q 
in ®, 

OO 

U* (e (t) , t) = [ {e T (p) Qe (p) + p* T (p) Rp* (p)) dp, 



>V e {e (i)),V£ GR+ 



(36) 



where V e (e (t)) = J°° (e T (p) Qe (/?)) dp is a positive definite 
function. According to Lemma 4.3 in [29 1 there exists a class 
K function u(||e||) such that u(||e||) < V e (e(t),t), which 
along with (3BJ), implies (|35a[) . 



b) Since V* (e (t) , t) depends explicitly on time only 
through the desired trajectory x d it), it is sufficient to prove 
(|35&| for all x d (t) G X- From 

(0, x d (t)) = y ( M * T (p) Rp* (p)) dp, (37) 



where 



(*) = argmin / (p T (p) Rp (p)) dp. 
t 



The policy 



P* (t) =0,V* g 



(38) 



minimizes V* (0, x<i (i)) . Furthermore, V* (0, Xd (t)) is the 
cost incurred when starting with e (t) =0 and following the 
optimal policy thereafter for any arbitrary desired trajectory 
x d (t) (cf. Section 3.7 of 0). Substituting x (0) = x d (0), 
// (0) = and © in (gj), e (0) = 0. Thus, when starting from 
e (t) = 0, the optimal policy in (|38| satisfies the dynamic 
constraints in (fj|. Substituting (|38p into (I57|) . the optimal cost 
is V* (0, x d (t)) = 0, Vx d (t) € x which implies 

c) Admissibility of the optimal policy implies that 
V* (e (t) ,t) is bounded for all bounded C(*)> which along 
with Assumption [3] implies that V* (e (t) , t) is uniformly 
bounded. From Assumptions [T] [2] and |4] and the fact that 
W a (t) is bounded by projection, the time derivative of 
V* (e (t) , t) along the trajectories of (gj) with the control 
/i (£(£)) is bounded. Thus V* (e(t) , f) is uniformly contin- 
uous on x- Using Lemma [TJ and (|35a[) and (|356[) . there exists 
a positive definite function a (e (i)) such that V* (e (t) , t) < 
a(e(t)),yt G M + . According to Lemma 4.3 in |29| there 
exists a class K. function v ( 1 1 e 1 1 ) such that a (e (t)) < u (||e||), 
which implies (|35dl . ■ 



Theorem 1. Provided that the sufficient conditions 

e'Lp I r] c v C 2(p 

Vol > r\ a \t1l-i H — ■—- + rjali2l-3 




(39) 



are satisfied, Assumptions\l]-^\hold, and the repressor vector 
ip (C (t) , p (£ (t)) ,r (i)) /s persistently exciting, the controller 
in (|14l) w;f/i update laws in (|16l) - (1181) guarantees UUB 
tracking of the desired trajectory and UUB convergence of 
the policy p {(, (t)) to the optimal policy p* (£ (f)). 



Proof: Consider 
V L (e(t),W c (t),W a (t),tj : X 
defined as 

Vl = V 



the 



function 



V, 



Using Lemma |2] and 

«l(||^||) < ^ <W(||Z||) 



(40) 



for all Z (t) e Bb, where 

Z±[e T Wj Wjf EZC x xR 2N , 

vi(\\Z\\) : [0,6] -)■ [0,oo) and W(||Z||) : [0,6] ->■ [0,oo) are 
class JC functions, and Bb C Z denotes a ball of radius 6 € M + 
around the origin. The time derivative of Vl (Z (t) , t) is 



V L =V c *F + V c * Gp 



OVr. 



Using d2T|) and the facts that from ©, = -V^Gp* 

r (C, p*) and from ([10]), V C *G = ~2p* T R yields 

Vl = -e T Qe + p* T Rp* - 2p* T Rp 



+ 



dW c 
dVc 
dW c 



1 + vlo t Tuj 



(p T Rp - p* T Rp* 



+ W T <j / G{p-p*)-e'(F + Gp*)^. (41) 

Using (18|), (J20]) and the bounds in (J25j) - ([2Tj) the Lyapunov 
derivative in (j4"T|) can be bounded above as 



where, 

K, 



V L < -K e 2 \\e\\ 
+ K e \\e\\ 

{Q} 



W 2 



Va2 



W a 



K 



w,. 



(42) 



2 I /I7(^ 



f?ali2i3 



Wcl - ?7alt2t3 



2 I 



r\a\L2l3 



ValL-3t-2e'L F 
VcVc2^P 



H + i3 (^?alt2 (^3 + ^l) + 7? a 2) , 



X = t 4 + Valil^Z- 

Provided the sufficient conditions in (f31)l) are satisfied, Lemma 
4.3 in ||29l along with completion of the squares on ||e|| and 

"W B ' 



in (jUJ) yields 

V L < -vi i 



|Z||),V||Z||>t B >0, 



(43) 



A". 



where t6 - «r ^5R^ + 2^ + ^J- and «l(ll^(*)ll) : 
[0, 6] -> [0, oo) is a class /C function. Using (gUJ), ([35]). and 
Theorem 4.18 in (29), Z (t) is UUB. ■ 

Remark 2. If ||Z(0)|| > 15, and if the gain conditions in 
dnHD are satisfied then Vi (Z (0) , 0) < 0. Thus, V L (Z (t) , t) 
is decreasing at t = 0. Thus, Z (t) G £oo, and hence, 
C (t) G £00 at t = + . Thus all the conditions of Theorem 
[TJ are satisfied at t = + . As a result, Vl (Z (t) ,t) is 
decreasing at < = + . By induction, ||Z(0)|| > (.5 =>■ 
Vi(Z(t),t) < Vi(Z(0),0),Vt G R+. Thus, from (gDJl, 
||e(*)|| < ||Z(t)|| < vr 1 (W(||Z(0)||)) which implies 



||z(t)|| < «l _1 (W(||^(0)||)) + ||id(t)||,Vt G R+. If 
|| Z (0)|| < l 5 then (gD]) and (gSJ) can be used to determine 
that vi{\\Z\\) < V L {Z(t),t) < W(NI),Vi G R+. As 
a result, \\Z\\ < vC 1 (v[ (1-5)) , which implies ||x(i)|| < 
vi~ x (W(t5)) + \\xd (Oil S R + - Define a positive constant 
f G R as x = Wj-MW (max (||Z (0)|| , t 5 ))) + IM0)||. 
This relieves Assumption [3] in the sense that the compact set 
X C R™ that contains the system trajectories x (t) ,Vi G R + 
is given by X = {x (t) G R" | ||ac (t)|| < x} ■ 
Remark 3. The gain condition on Q in (|39|) appears to be a 
restriction on the cost function that one can choose. However, 
as the number of neurons in the NN approximation (fTTj) 
increases the error bound e' reduces. Thus, given any positive 
definite matrix Q one can find a number N such that with N 
or more hidden layer neurons, the gain condition in ([39| is 
satisfied for any positive definite Q. Similar conditions can be 
found in related literature (cf. lHU). 

V. Conclusion 

An ADP-based approach using the policy evaluation (Critic) 
and policy improvement (Actor) architecture is presented to 
approximately solve the infinite horizon optimal tracking prob- 
lem for control affine nonlinear systems with quadratic cost. 
The problem was solved by transforming the system to convert 
the tracking problem that has a non-stationary cost function, 
into a regulation problem that has a stationary cost function. 
The UUB tracking and estimation result was established using 
Lyapunov analysis for nonautonomous systems. Like other 
ADP-based results, this result hinges on the system states 
being persistently excited. Furthermore, the control policy in 
(|T4| requires exact model knowledge. A model-free solution to 
the tracking problem, that relaxes the persistence of excitation 
condition, will be the focus of future research. 
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